Show the code
pacman::p_load(jsonlite, tidygraph, ggraph, visNetwork, graphlayouts, ggforce,skimr,tidytext, tidyverse,igraph, topicmodels,tm, stringr, ldatuning)Uncover illegal, unreported, and unregulated (IUU) fishing activities through visual analytics
Using the data provide by the VAST challenge, we are looking into the Mini-Challenge 3 (MC3) to identify compaines possibly engaged in illegal, unreported, and unregulated (IUU) fishing.
pacman::p_load(jsonlite, tidygraph, ggraph, visNetwork, graphlayouts, ggforce,skimr,tidytext, tidyverse,igraph, topicmodels,tm, stringr, ldatuning)In the code chunk below, fromJSON() of jsonlite package is used to import MC3.json into R environment.
mc3_data <- fromJSON("data/MC3.json")Examine the data, this is not a directed graph, not looking into in- and out-degree of the nodes.
Below code chunk changes the links field into character field.
mc3_edges <- as_tibble(mc3_data$links)%>%
distinct() %>%
mutate(source = as.character(source),
target = as.character(target),
type = as.character(type)) %>%
group_by(source, target, type) %>%
summarise(weights = n()) %>%
filter(source!=target)%>%
ungroupmc3_nodes <- as_tibble(mc3_data$nodes) %>%
# distinct()%>%
mutate(country = as.character(country),
id = as.character(id),
product_services = as.character(product_services),
revenue_omu = as.numeric(as.character(revenue_omu)),
type = as.character(type)) %>%
select(id, country, type, revenue_omu, product_services)In the code chunk below, skim() of skimr package is used to display the summary statistics of mc3_edges tibble data frame.
skim(mc3_edges)| Name | mc3_edges |
| Number of rows | 24036 |
| Number of columns | 4 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| source | 0 | 1 | 6 | 700 | 0 | 12856 | 0 |
| target | 0 | 1 | 6 | 28 | 0 | 21265 | 0 |
| type | 0 | 1 | 16 | 16 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| weights | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | ▁▁▇▁▁ |
The report above reveals that there is not missing values in all fields.
In the code chunk below, datatable() of DT package is used to display mc3_edges tibble data frame as an interactive table on the html document.
DT::datatable(mc3_edges)Below code chunks, counting number of companies a person owns.
ggplot(data = mc3_edges,
aes(x=type)) +
geom_bar()
unique_ids <- unique(mc3_edges$target)
num_unique_ids <- length(unique_ids)
num_unique_ids[1] 21265
Noofcompanies <- mc3_edges %>%
group_by(target, source, type) %>%
filter(type == "Beneficial Owner") %>%
summarise(count=n()) %>%
group_by(target)%>%
summarise(count=sum(count))
psych::describe(Noofcompanies) vars n mean sd median trimmed mad min max range skew
target* 1 15305 7653.0 4418.32 7653 7653 5672.43 1 15305 15304 0.00
count 2 15305 1.1 0.40 1 1 0.00 1 9 8 6.28
kurtosis se
target* -1.20 35.71
count 61.69 0.00
Below code chunk summarizes the numbers of owners to the company.
Noofowners <- mc3_edges %>%
group_by(source, target, type) %>%
summarise(count=n()) %>%
group_by(source)%>%
summarise(count=sum(count))
psych::describe(Noofowners) vars n mean sd median trimmed mad min max range skew
source* 1 12856 6428.50 3711.35 6428.5 6428.50 4765.08 1 12856 12855 0.00
count 2 12856 1.87 3.47 1.0 1.22 0.00 1 120 119 11.36
kurtosis se
source* -1.20 32.73
count 215.82 0.03
Below code chunk we are interested to see top 50 owners owning multiple companies, with John Smith and Michael Johnson have the highest of 9 companies to their name. This could be suspicious as why they need so many companies.
list_top_50 <- Noofcompanies %>%
arrange(desc(count)) %>%
top_n(50, wt = count)
ggplot(data = list_top_50,
aes(x = reorder(target, -count), y = count)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) 
skim(mc3_nodes)| Name | mc3_nodes |
| Number of rows | 27622 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 6 | 64 | 0 | 22929 | 0 |
| country | 0 | 1 | 2 | 15 | 0 | 100 | 0 |
| type | 0 | 1 | 7 | 16 | 0 | 3 | 0 |
| product_services | 0 | 1 | 4 | 1737 | 0 | 3244 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| revenue_omu | 21515 | 0.22 | 1822155 | 18184433 | 3652.23 | 7676.36 | 16210.68 | 48327.66 | 310612303 | ▇▁▁▁▁ |
The report above reveals that there is no missing values in all fields.
In the code chunk below, datatable() of DT package is used to display mc3_nodes tibble data frame as an interactive table on the html document.
DT::datatable(mc3_nodes)Below code chunk to find out how is the distribution among the types of ownerhships.
ggplot(data = mc3_nodes,
aes(x = type)) +
geom_bar()
Below code chunk we check on the revenue distribution among the types of ownerships.
ggplot(data = mc3_nodes,
aes(x= type,
y = revenue_omu)) +
geom_boxplot()
We combined the nodes and edges data so we can find out more on the owner-company relationships.
combined <- left_join(mc3_nodes,mc3_edges,
by=c("id"="source"))Below code chunk to find out more on which owners have high number of companies also generating a lot of revenue.
combined <- combined %>%
group_by(target, type.y, id, country, type.x, product_services)%>%
summarize(revenue_omu) %>%
filter(type.y == "Beneficial Owner")
filtered_combined <- combined %>%
filter(target %in% list_top_50$target)%>%
arrange(desc(revenue_omu))
ggplot(data = filtered_combined,
aes(x = target, y = revenue_omu)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) 
Michael Johnson, Mark Miller and James Rodriguez stand out from the above chart, below code we want to see what business they did that generate more revenue.
Top_3_Revenue<- combined %>%
filter (target %in% c("Michael Johnson", "Mark Miller","James Rodriguez")) %>%
arrange(desc(revenue_omu))
DT::datatable(Top_3_Revenue)From the above data table, we see that Michael Johnson is involved in the fishing business and having many companies in different countries. The FishEye could probably look more into his business landscape across different companies and his business activities to understand more.
The code chunk below calculates number of times the word fish appeared in the field product_services.
mc3_nodes %>%
mutate(n_fish = str_count(product_services, "fish")) # A tibble: 27,622 × 6
id country type revenue_omu product_services n_fish
<chr> <chr> <chr> <dbl> <chr> <int>
1 Jones LLC ZH Comp… 310612303. Automobiles 0
2 Coleman, Hall and Lopez ZH Comp… 162734684. Passenger cars,… 0
3 Aqua Advancements Sashimi … Oceanus Comp… 115004667. Holding firm wh… 0
4 Makumba Ltd. Liability Co Utopor… Comp… 90986413. Car service, ca… 0
5 Taylor, Taylor and Farrell ZH Comp… 81466667. Fully electric … 0
6 Harmon, Edwards and Bates ZH Comp… 75070435. Discount superm… 0
7 Punjab s Marine conservati… Riodel… Comp… 72167572. Beef, pork, chi… 0
8 Assam Limited Liability … Utopor… Comp… 72162317. Power and Gas s… 0
9 Ianira Starfish Sagl Import Rio Is… Comp… 68832979. Light commercia… 0
10 Moran, Lewis and Jimenez ZH Comp… 65592906. Automobiles, tr… 0
# ℹ 27,612 more rows
The word tokenisation have different meaning in different scientific domains. In text sensing, tokenisation is the process of breaking up a given text into units called tokens. Tokens can be individual words, phrases or even whole sentences. In the process of tokenisation, some characters like punctuation marks may be discarded. The tokens usually become the input for the processes like parsing and text mining.
In the code chunk below, unnest_token() of tidytext is used to split text in product_services field into words.
token_nodes <- mc3_nodes %>%
unnest_tokens(word,
product_services)The two basic arguments to unnest_tokens() used here are column names. First we have the output column name that will be created as the text is unnested into it (word, in this case), and then the input column that the text comes from (product_services, in this case).
token_nodes %>%
count(word, sort = TRUE) %>%
top_n(15) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_col() +
xlab(NULL) +
coord_flip() +
labs(x = "Count",
y = "Unique words",
title = "Count of unique words found in product_services field")
The bar chart reveals that the unique words contains some words that may not be useful to use. For instance “a” and “to”. In the word of text mining we call those words stop words. You want to remove these words from your analysis as they are fillers used to compose a sentence.
Using filter we also discover many “character(0)” which has no meaning in itself, we will also proceed to replace them with “NA”.
token_nodes$word[token_nodes$word == "character"] <- "NA"
token_nodes$word[token_nodes$word == "0"] <- "NA"stopwords_removed <- token_nodes %>%
anti_join(stop_words)stopwords_removed %>%
count(word, sort = TRUE) %>%
top_n(15) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_col() +
xlab(NULL) +
coord_flip() +
labs(x = "Count",
y = "Unique words",
title = "Count of unique words found in product_services field")
Below code removed the top 3 irrelevant words.
stopwords_removed %>%
count(word, sort = TRUE) %>%
top_n(20) %>%
mutate(word = reorder(word, n)) %>%
filter(!word %in% head(word, 3)) %>%
ggplot(aes(x = word, y = n)) +
geom_col() +
xlab(NULL) +
coord_flip() +
labs(x = "Count",
y = "Unique words",
title = "Count of unique words found in product_services field")
This section, we will conduct topic modelling to find out the similar business groups. For this, we will make use of Latent Dirichlet Allocation (LDA).
First, run the below code chunk to remove the irrelevant top 3 keywords: “NA”, “unknown”, “products”.
stopwords_removed <- stopwords_removed %>%
filter(!word %in% c("NA", "unknown", "products"))
dim(stopwords_removed)[1] 44082 5
We need to convert the document to a document-term matrix to to run LDA.
# using as.matrix()
dtm <- stopwords_removed %>%
count(id, word) %>% # count each word used in each identified review
cast_dtm(id, word, n) %>% # use the word counts by reviews to create a DTM
as.matrix()# create models with different number of topics
result <- ldatuning::FindTopicsNumber(
dtm,
topics = seq(from = 2, to = 20, by = 1),
metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"),
method = "Gibbs",
control = list(seed = 77),
verbose = TRUE
)fit models... done.
calculate metrics:
Griffiths2004... done.
CaoJuan2009... done.
Arun2010... done.
Deveaud2014... done.
FindTopicsNumber_plot(result)
From the above graph, we see that after topic 12, for “Griffiths2004”, there isn’t much increase in the value, hence, we choose topics to be 12.
# number of topics
K <- 12
# set random number generator seed
set.seed(1234)
# compute the LDA model, inference via 1000 iterations of Gibbs sampling
topicModel <- LDA(dtm, K, method="Gibbs", control=list(iter = 500, verbose = 25))K = 12; V = 7746; M = 3894
Sampling 500 iterations!
Iteration 25 ...
Iteration 50 ...
Iteration 75 ...
Iteration 100 ...
Iteration 125 ...
Iteration 150 ...
Iteration 175 ...
Iteration 200 ...
Iteration 225 ...
Iteration 250 ...
Iteration 275 ...
Iteration 300 ...
Iteration 325 ...
Iteration 350 ...
Iteration 375 ...
Iteration 400 ...
Iteration 425 ...
Iteration 450 ...
Iteration 475 ...
Iteration 500 ...
Gibbs sampling completed!
lda_topics <- topicModel %>%
tidy(matrix = "beta")lda_topics <- LDA(
dtm,
k = 12,
method = "Gibbs",
control = list(seed=42)
) %>%
tidy(matrix = "beta")
word_probs <- lda_topics %>%
group_by(topic) %>%
top_n(15, beta) %>%
ungroup() %>%
mutate(term2 = fct_reorder(term, beta))
ggplot(
word_probs,
aes(term2, beta, fill=as.factor(topic))
) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip()
From the above result, we can see topic 5 and topic 10 are most related to fishing industry. Topic 5 is a group of fishing business likely to have fresh sea products, while topic 10 is a group of fishing business likely to have more processed sea products. In the following network analysis session, we will focus on the nodes that contains topic 5 and topic 10 keywords.
From the above text insights, we are interested to see the network of companies of Beneficial Owners with topic 5 as their product services.
mc3_nodes_topic5 <- stopwords_removed %>%
filter(stopwords_removed$word %in% c("fish","salmon","tuna","shrimp","squid","frozen","sea","cod","fillets","crabs","fillet","pollock","lobster","smoked"))
mc3_nodes_topic10 <- stopwords_removed %>%
filter(stopwords_removed$word %in% c("seafood","fish","fresh","processing","seafoods","marine","shellfish","frozen","aquatic","oils","canning","packing","fats","cured","preparation"))mc3_edges_topic5 <- mc3_edges[mc3_edges$source %in% mc3_nodes_topic5$id,] %>%
filter(type == "Beneficial Owner")
id1 <- mc3_edges_topic5 %>%
select(source) %>%
rename(id = source)
id2 <- mc3_edges_topic5 %>%
select(target) %>%
rename(id = target)
mc3_nodes_topic5 <- rbind(id1, id2) %>%
distinct() %>%
left_join(mc3_nodes_topic5,
unmatched = "drop")
mc3_edges_topic10 <- mc3_edges[mc3_edges$source %in% mc3_nodes_topic10$id,] %>%
filter(type == "Beneficial Owner")
id1 <- mc3_edges_topic10 %>%
select(source) %>%
rename(id = source)
id2 <- mc3_edges_topic10 %>%
select(target) %>%
rename(id = target)
mc3_nodes_topic10 <- rbind(id1, id2) %>%
distinct() %>%
left_join(mc3_nodes_topic10,
unmatched = "drop") mc3_graph_topic5 <- tbl_graph(nodes = mc3_nodes_topic5, edges = mc3_edges_topic5, directed = FALSE)
mc3_graph_topic5 <-mc3_graph_topic5 %>%
mutate(betweenness=centrality_betweenness())
mc3_graph_topic10 <- tbl_graph(nodes = mc3_nodes_topic10, edges = mc3_edges_topic10, directed = FALSE)
mc3_graph_topic10 <-mc3_graph_topic10 %>%
mutate(betweenness=centrality_betweenness()) Using the distribution function to understand the centrality_betweenness() for topic 5 and 10 graphs.
ggplot(as.data.frame(mc3_graph_topic5),aes(x=betweenness))+
geom_histogram(bins=10,fill="lightblue",colour="black")+
ggtitle("Distribution of centrality betweenness for topic 5")+
theme(plot.title = element_text(hjust=0.5))
ggplot(as.data.frame(mc3_graph_topic10),aes(x=betweenness))+
geom_histogram(bins=10,fill="lightblue",colour="black")+
ggtitle("Distribution of centrality betweenness for topic 10")+
theme(plot.title = element_text(hjust=0.5))
Looking at this, we can filter our records where the centrality between is greater than 75 to understand the interactions.
set.seed (1234)
degrees <- degree(mc3_graph_topic5)
V(mc3_graph_topic5)$degree <- degrees
mc3_graph_topic5 %>%
filter(betweenness >= 75) %>%
ggraph(layout = "fr") +
geom_edge_link(aes(alpha=0.5)) +
geom_node_point(aes(
size = betweenness,
colors = "lightblue",
alpha = 0.5)) +
scale_size_continuous(range=c(1,10))+
geom_node_text(aes(label = id, filter= betweenness >=200 °ree >0), repel = TRUE)+
theme_graph()
set.seed (1234)
degrees <- degree(mc3_graph_topic10)
V(mc3_graph_topic10)$degree <- degrees
mc3_graph_topic10 %>%
filter(betweenness >= 75) %>%
ggraph(layout = "fr") +
geom_edge_link(aes(alpha=0.5)) +
geom_node_point(aes(
size = betweenness,
colors = "lightblue",
alpha = 0.5)) +
scale_size_continuous(range=c(1,10))+
geom_node_text(aes(label = id, filter= betweenness >=200 °ree >0), repel = TRUE)+
theme_graph()
Looking at the above two network graph, we see there are some overlapping companies in these two topic groups. Upcoming work will look into the similarities between these two groups.